Dirty Data, Bad Predictions, and the Ethics of Crime Forecasting
Opening Thought
Before we discuss HOW to build predictive policing systems, we need to ask: SHOULD we?
“A statistically ‘good’ model can still be socially harmful.”
Richardson, Schultz & Crawford (2019)
Today’s Critical Questions
Technical Questions:
How do we model crime counts?
What spatial features predict crime?
How do we validate predictions?
Can we outperform baseline methods?
Critical Questions:
Whose data? Whose crimes?
What if the data is “dirty”?
Who benefits? Who is harmed?
What feedback loops are created?
Can technical solutions fix social problems?
ImportantToday’s Approach
We will learn the technical methods AND critically interrogate their use. Both skills are essential for ethical data science.
Part 1: The Seductive Promise of Predictive Policing
The Sales Pitch
What vendors and police departments claim:
Efficiency: “Deploy limited resources where they’re needed most”
Objectivity: “Remove human bias from decision-making”
Proactivity: “Prevent crime before it happens”
Data-driven: “Let the data tell us where crime will occur”
Sounds great, right?
. . .
But these claims rest on critical assumptions:
That crime data accurately reflects crime (it doesn’t)
That past patterns predict future crime (they might just predict policing)
That we can separate “good” from “bad” data (we often can’t)
That technical solutions can fix social problems (they can’t)
The Technical Evolution
Generation | Method | Data Used | Example |
||–|–|| | 1st: Hotspots | Kernel Density | Past crime locations | KDE maps | | 2nd: Risk Terrain | Logistic Reg. | Crime + features | RTM software | | 3rd: ML | Random Forest, Neural Nets | Everything | PredPol, Palantir | | 4th: Person-Based | Network analysis | Social connections | Strategic Subject List |
Each generation claims to be more “objective” and “accurate”
But what if they’re all built on the same flawed foundation?
Part 2: The Dirty Data Problem
What Is “Dirty Data”?
Traditional definition (data mining):
Missing data
Incorrect data
Non-standardized formats
Extended definition (Richardson et al. 2019):
“Data derived from or influenced by corrupt, biased, and unlawful practices, including data that has been intentionally manipulated or ‘juked,’ as well as data that is distorted by individual and societal biases.”
The Many Forms of Dirty Data
1. Fabricated/Manipulated Data
False arrests (planted evidence)
Downgraded crime classifications to “juke the stats”
Pressuring victims not to file reports
2. Systematically Biased Data
Over-policing of certain communities → more recorded “crime”
Under-policing of white-collar crime → appears less common
Racial profiling → disproportionate stops/arrests
3. Missing/Incomplete Data
Unreported crimes (especially in over-policed areas with low police trust)
Ignored complaints
Incomplete records
4. Proxy Problems
Arrests ≠ crimes committed
Calls for service ≠ actual need
Gang database ≠ actual gang membership
Case Study 1: “Juking the Stats” - The Wire and Baltimore
From TV to Reality:
The Wire (2004): “If the crime rate doesn’t fall, you most certainly will”
Baltimore Reality (2008-2018):
14,000+ serious assaults misrecorded as minor offenses
Extensive Gun Trace Task Force corruption
Officers robbing residents, planting evidence
False arrests, fabricated reports
Data manipulation to show “success”
Result: 55+ potential lawsuits, thousands of convictions questioned
Question: What happens when this data trains a predictive algorithm?
Case Study 2: CompStat and NYPD
The Promise: Data-driven accountability, crime reduction
The Reality Revealed:
100+ retired NYPD captains surveyed: Intense pressure led to stat manipulation
Serious crimes downgraded to meet targets
Officers planting drugs to meet arrest quotas
Commanders persuading victims not to file reports
The Dual Strategy:
Downgrade serious crimes (reported to FBI) → claim success
Increase minor arrests (stops, summonses) → show “control”
2013: Independent audit confirmed systematic data problems
The Feedback Loop Diagram
The Confirmation Bias Loop:
Algorithm learns: “Crime happens in neighborhood X”
Police sent to neighborhood X
More arrests in neighborhood X (regardless of actual crime)
Algorithm “confirmed”: “We were right about neighborhood X!”
Cycle intensifies
Part 3: Technical Fixes Can’t Solve Social Problems
Vendor Claims About Bias Mitigation
PredPol claims:
“Uses ONLY 3 data points—crime type, crime location, and crime date/time”
HunchLab claims:
“We would not use data that relates people to predict places—no arrests, no social media, no gang status”
Both exclude: Arrest data, stop data, traffic stops
Both include: Crime reports, calls for service
Why “Cleaning” The Data Isn’t Enough
Problem 1: Crime reports reflect police decisions
Officer decides what to investigate
Officer decides what to classify as “crime”
Officer decides what to document
Problem 2: Calls for service reflect community bias
Neighbors calling police on Black people barbecuing
“Suspicious activity” = person of color in “wrong” neighborhood
Gentrification → increased 311 calls on existing residents
Problem 3: What counts as “clean” data?
If drug arrests are racially biased, exclude them ✓
But isn’t burglary enforcement also biased? What about assault?
Where do you draw the line?
The Impossibility of Neutral Crime Data
Crime data is ALWAYS:
Socially constructed - Societies define what counts as “crime”
Selectively enforced - More resources to some neighborhoods
Organizationally filtered - Police priorities, department culture
Politically shaped - “Tough on crime” eras, moral panics
Can predict negative values (impossible for counts)
Assumes constant variance (counts often have variance ≠ mean)
Assumes continuous outcome (counts are discrete)
Assumes normal errors (count data is skewed)
Distribution of Crime Counts
Code
# Typical pattern for crime dataggplot(fishnet, aes(x = countBurglaries)) +geom_histogram(binwidth =1, fill ="#440154FF", color ="white") +labs(title ="Distribution of Burglary Counts",subtitle ="Most cells have 0-2 burglaries, few have many",x ="Burglaries per Cell",y ="Number of Cells" ) +theme_minimal()
Ensures \(\lambda_i > 0\) (counts can’t be negative)
Linear relationship on log scale
Multiplicative effects on count scale
Interpreting Poisson Coefficients
On log scale:
\(\beta_1\) = change in log(expected count) per unit increase in \(X_1\)
On count scale (exponentiate):
\(\exp(\beta_1)\) = multiplicative effect on expected count
Examples:
\(\beta\) | \(\exp(\beta)\) | Interpretation |
|||-| | 0.14 | 1.15 | 15% increase per unit of X | | -0.22 | 0.80 | 20% decrease per unit of X | | 0.00 | 1.00 | No effect | | 0.69 | 2.00 | Doubling per unit of X |
Poisson Regression in R
# Fit Poisson modelmodel_poisson <-glm( countBurglaries ~ Abandoned_Cars + Abandoned_Cars.nn + abandoned.isSig.dist,data = fishnet,family =poisson(link ="log"))# View resultssummary(model_poisson)# Exponentiate coefficients for interpretationexp(coef(model_poisson))# Example output:# exp(coef)# (Intercept) 0.234# Abandoned_Cars 1.151# Abandoned_Cars.nn 0.998# abandoned.isSig.dist 0.999# Interpretation:# - Each additional abandoned car → 15.1% increase in expected burglaries# - Each meter from nearest abandoned car → 0.2% decrease in expected burglaries
The Overdispersion Problem
Poisson assumption: Variance = Mean
Reality with crime data: Variance > Mean (often MUCH larger)
Why overdispersion occurs:
Unobserved heterogeneity: Some areas have unmeasured crime attractors
Contagion effects: One crime leads to others (not independent)
Measurement error: Counting issues, data quality
Model misspecification: Missing important variables
Check for overdispersion:
\[\text{Dispersion} = \frac{\text{Residual Deviance}}{\text{Degrees of Freedom}}\]
If ≈ 1: Poisson is fine
If > 1: Overdispersion (common!)
If > 2-3: Serious overdispersion → Use Negative Binomial
If \(\alpha > 0\): Allows extra variance (overdispersion)
Interpretation: Coefficients interpreted same way as Poisson!
Negative Binomial in R
library(MASS)# Fit Negative Binomial modelmodel_nb <-glm.nb( countBurglaries ~ Abandoned_Cars + Abandoned_Cars.nn + abandoned.isSig.dist,data = fishnet)# View resultssummary(model_nb)# Compare to PoissonAIC(model_pois) # e.g., 8234.5AIC(model_nb) # e.g., 6721.3# Lower AIC = better fit# If NegBin AIC much lower → use NegBin# Extract dispersion parameter (theta)model_nb$theta # e.g., 2.47# Interpretation: Significant overdispersion confirmed
Comparing Poisson vs. Negative Binomial
Aspect | Poisson | Negative Binomial |
|–||-| | Variance assumption | Var = Mean | Var = μ + αμ² | | Overdispersion | Cannot handle | Accommodates | | Standard errors | Underestimated if overdispersed | Correctly estimated | | When to use | Count data, no overdispersion | Count data with overdispersion | | Crime data | Rarely appropriate | Usually better |
For today’s lab: We’ll fit both, compare, and use the better model
Model Diagnostics for Count Models
Unlike OLS, we don’t use residual plots the same way
Key diagnostics:
Dispersion test (already covered)
Deviance residuals: Should be roughly normal
Pearson residuals: Check for outliers
Cook’s distance: Influential observations
Predicted vs. observed: Visual check
# Deviance residualsplot(model_nb, which =1) # Residuals vs. Fitted# Identify outliersoutliers <-which(abs(residuals(model_nb, type ="deviance")) >3)# Influenceinfluence <-cooks.distance(model_nb)influential <-which(influence >4/length(influence))
Con: Arbitrary, unequal sizes, Modifiable Areal Unit Problem
Fishnet grid (regular cells)
Pro: Consistent size, no boundary bias
Con: Arbitrary, may split “natural” areas
We use fishnet because:
Standard approach in predictive policing
Easier spatial operations
Consistent unit of analysis
Creating a Fishnet Grid
library(sf)# Step 1: Define cell size (in map units - meters for our projection)cell_size <-500# 500m x 500m cells# Step 2: Create grid over study areafishnet <-st_make_grid( chicago_boundary,cellsize = cell_size,square =TRUE,what ="polygons"#you could change to hexagons if you wanted to be fancy.) %>%st_sf() %>%mutate(uniqueID =row_number())# Step 3: Clip to study area (remove cells outside boundary)fishnet <- fishnet[chicago_boundary, ]# Check resultsnrow(fishnet) # Number of cellsst_area(fishnet[1, ]) # Area of one cell (should be 250,000 m²)
Grid Cell Size: A Critical Choice
Common sizes:
250m × 250m: Fine-grained, many cells, computationally intensive
500m × 500m: Standard, balance detail and computation
1000m × 1000m: Coarse, faster, loses local detail
Smaller cells:
✓ More spatial detail
✓ Better capture local patterns
✗ More zeros (sparse data)
✗ Computational cost
Larger cells:
✓ Fewer zeros
✓ More stable estimates
✗ Lose local variation
✗ May obscure hotspots
Choice affects results! No “correct” answer.
Aggregating Points to Grid
Process:
Spatial join between crimes (points) and fishnet (polygons)
Count crimes per cell
Handle cells with zero crimes
# Count burglaries per cellburglary_counts <-st_join(burglaries, fishnet) %>%st_drop_geometry() %>%group_by(uniqueID) %>%summarize(countBurglaries =n())# Join back to fishnetfishnet <- fishnet %>%left_join(burglary_counts, by ="uniqueID") %>%mutate(countBurglaries =replace_na(countBurglaries, 0))# Summarysummary(fishnet$countBurglaries)# Min. 1st Qu. Median Mean 3rd Qu. Max. # 0 0 1 2.3 3 47
Handling Zeros in Count Data
Crime data typically has MANY zeros
Example distribution: - 40% of cells: 0 burglaries - 30% of cells: 1 burglary - 20% of cells: 2-3 burglaries - 10% of cells: 4+ burglaries
Implications:
Poisson handles zeros naturally (built into distribution)
Zero-inflation: If >60% zeros, consider Zero-Inflated Poisson (ZIP)
For today: Standard Negative Binomial handles our zeros fine
Critical interpretation: Are zeros “true zeros” (no crime) or “missing data” (unreported)?
Part 7E: Spatial Cross-Validation
Why Standard Cross-Validation Fails for Spatial Data
Standard k-fold CV:
Randomly split data into k folds
Train on k-1 folds, test on 1
Repeat k times
Problem with spatial data:
Nearby observations are correlated
Training set includes cells adjacent to test cells
Spatial leakage: Model learns from neighbors of test set
Overly optimistic performance estimates
Solution: Spatial cross-validation
Leave-One-Group-Out Cross-Validation (LOGO-CV)
Principle: Hold out entire spatial groups, not individual cells
Process:
Divide study area into groups (e.g., police districts)
Hold out all cells in District 1
Train model on Districts 2-N
Predict for District 1
Repeat for each district
Why better:
Tests generalization to truly new areas
No spatial leakage between train/test
More realistic deployment scenario
Conservative performance estimates
LOGO-CV Implementation
# Get unique districtsdistricts <-unique(fishnet$District)# Initialize resultscv_results <-list()# Loop through districtsfor (dist in districts) {# Split data train_data <- fishnet %>%filter(District != dist) test_data <- fishnet %>%filter(District == dist)# Fit model on training data model_cv <-glm.nb( countBurglaries ~ Abandoned_Cars + Abandoned_Cars.nn + abandoned.isSig.dist,data = train_data )# Predict on test data test_data$prediction <-predict(model_cv, test_data, type ="response")# Store results cv_results[[dist]] <- test_data}# Combine all predictionsall_predictions <-bind_rows(cv_results)
Evaluating CV Performance
Common metrics for count models:
Mean Absolute Error (MAE):\[MAE = \frac{1}{n}\sum_i |y_i - \hat{y}_i|\]
Root Mean Squared Error (RMSE):\[RMSE = \sqrt{\frac{1}{n}\sum_i (y_i - \hat{y}_i)^2}\]
Mean Error (Bias):\[ME = \frac{1}{n}\sum_i (y_i - \hat{y}_i)\]